# To clear the list and history of R
rm(list = ls())
# Load necessary libraries
library(ggplot2)
#install.packages(GGally)
#library(GGally)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(boot)
library(car) # for scatterplotMatrix
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
library(boot) # for bootstrapping
library(MASS) # for Box-Cox transformation
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Load the dataset
admission_data <- read.csv("/Users/ileend03/Downloads/project 1/Admission_Predict.csv")
View(admission_data)
The dataset that was chosen for this project aims to predict the chance of graduate admission into master’s programs inspired by the UCLA Graduate Dataset and specifically from an Indian perspective. There are six key variables, however in this project only GRE scores, TOEFL Scores, Undergraduate GPA, and Statement of Purpose (SOP) will be examined. The target or response variable is Chance of Admit. This project is interested in estimating a model that helps to understand which variables, and to what extent, have an impact and influence on a students chance of graduate admission. For the information provided the predictor variables will be corresponding to Models 1-4; Model 1 for CGPA, Model 2 for TOEFL Score, Model 3 for GRE Score, and Model 4 for SOP. This dataset was found on Kaggle and created by Mohan S Acharya, Asfia Armaan, and Aneeta S Antony. It is sourced from the paper “A Comparison of Regression Models for Prediction of Graduate Admissions,” and originally presented in 2019 at the IEEE International Conference on Computational Intelligence in Data Science.
Citation: Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019. https://www.kaggle.com/datasets/mohansacharya/graduate-admissions/data
# Summary statistics of CGPA
summary(admission_data$CGPA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.800 8.170 8.610 8.599 9.062 9.920
summary(admission_data$Chance.of.Admit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3400 0.6400 0.7300 0.7244 0.8300 0.9700
summary(admission_data[, c("CGPA", "Chance.of.Admit")])
## CGPA Chance.of.Admit
## Min. :6.800 Min. :0.3400
## 1st Qu.:8.170 1st Qu.:0.6400
## Median :8.610 Median :0.7300
## Mean :8.599 Mean :0.7244
## 3rd Qu.:9.062 3rd Qu.:0.8300
## Max. :9.920 Max. :0.9700
sd(admission_data$CGPA)
## [1] 0.5963171
sd(admission_data$Chance.of.Admit)
## [1] 0.1426093
The CGPA has an average of 8.58, with a median of 8.56 and a minimum of 6.8 and maximum of 9.92. The standard deviation is 0.60, which indicated that most of the CGPA scores are accumulated around the mean. For the Chance of Admission, the average is 0.72, the median is 0.7217, the maximum is 0.97 and the minimum is 0.34. The standard deviation is 0.14. These values suggest that most students fall within the middle range likelihood of gaining admission into graduate school.
# Summary statistics of TOEFL Score
summary(admission_data$TOEFL.Score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 92.0 103.0 107.0 107.4 112.0 120.0
summary(admission_data$Chance.of.Admit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3400 0.6400 0.7300 0.7244 0.8300 0.9700
summary(admission_data[, c("TOEFL.Score", "Chance.of.Admit")])
## TOEFL.Score Chance.of.Admit
## Min. : 92.0 Min. :0.3400
## 1st Qu.:103.0 1st Qu.:0.6400
## Median :107.0 Median :0.7300
## Mean :107.4 Mean :0.7244
## 3rd Qu.:112.0 3rd Qu.:0.8300
## Max. :120.0 Max. :0.9700
The summary statistics reveal a positive relationship between TOEFL Scores and Chance of Admission, higher TOEFL scores are associated with higher chances of admission. In addition, the mean value for TOEFL scores (107.4) is slightly higher than the median (107) which reveals that some high-scoring students are raising the average score. The standard deviation of TOEFL Scores is 6 which means that scores tend to deviate from the mean (107.4) by 6 points, this indicates a moderate variability in scores. In addition, the mean value for the Chance of Admission (0.7244) is slightly lower than the median (0.73) which reveals that there is a small number of applicants with lower probabilities that are pulling the average down. The standard deviation of TOEFL Scores is 6 which means that scores tend to deviate from the mean (107.4) by 6 points, this indicates a moderate variability in scores.
# Summary statistics for GRE Scores
summary(admission_data$GRE.Score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 290.0 308.0 317.0 316.8 325.0 340.0
summary(admission_data$Chance.of.Admit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3400 0.6400 0.7300 0.7244 0.8300 0.9700
summary(admission_data[, c("GRE.Score", "Chance.of.Admit")])
## GRE.Score Chance.of.Admit
## Min. :290.0 Min. :0.3400
## 1st Qu.:308.0 1st Qu.:0.6400
## Median :317.0 Median :0.7300
## Mean :316.8 Mean :0.7244
## 3rd Qu.:325.0 3rd Qu.:0.8300
## Max. :340.0 Max. :0.9700
The summary statistics would support the positive relationship between GRE scores and COA. The GRE score median is 317 whereas mean is 316.8, indicating a symmetric distribution. The minimum for GRE score is 290 and maximum is 340, which are the lowest and highest scores in the data set. The standard deviation for GRE score is 16.72, which suggest a wider range of scores as opposed to a compact distribution of admission chances. For COA, the median is 0.73, which depicts that 50% of students have a probability admission below 0.73; since the mean is 0.7244 and thus very close to median, there is a balanced distribution. The minimum is 0.34 and maximum is 0.97 which are both indicators of the lowest and highest probability of admission.
summary(admission_data$SOP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 2.5 3.5 3.4 4.0 5.0
summary(admission_data$Chance.of.Admit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3400 0.6400 0.7300 0.7244 0.8300 0.9700
summary(admission_data[, c("SOP", "Chance.of.Admit")])
## SOP Chance.of.Admit
## Min. :1.0 Min. :0.3400
## 1st Qu.:2.5 1st Qu.:0.6400
## Median :3.5 Median :0.7300
## Mean :3.4 Mean :0.7244
## 3rd Qu.:4.0 3rd Qu.:0.8300
## Max. :5.0 Max. :0.9700
The statistics summary shows that SOP has an average score of 3.4 and the median is 3.5. The minimum is 1.0 and the maximum is 5.0 meaning that the range is 4.0. Q1 is 1.0 from the median, whereas Q3 is 0.5 from the median, making Q1 farther from the median than Q3, revealing that the SOP data is left-skewed. Comparatively, the minimum for SOP is also respectively lower than that of chance of admission with a minimum of 0.34. In the statistical summary for SOP and chance for admission the mean and median of the dataset are close in value, which suggests that the data is approximately symmetric or normally distributed.
# Histogram for CGPA
hist(admission_data$CGPA,
breaks= 12,
col="plum",
main="Distribution of CGPA",
xlab="CGPA",
border="black",
prob=TRUE)
lines(density(admission_data$CGPA), col="red", lwd=2)
# Histogram for Chance of Admission
hist(admission_data$Chance.of.Admit,
breaks=10,
col="lightgreen",
main="Distribution of Chance of Admission",
xlab="Chance of Admission",
border="black",
prob=TRUE)
lines(density(admission_data$Chance.of.Admit), col="blue", lwd=2)
The histogram of CGPA closely follows the normal curve which means that most students have a CGPA around the average 8.58 (median= 8.56). The spread is compact since most of the students fall within the first quartile 8.13 and the third quartile 9.04. The compact structure suggests that most students have a CGPA around the mean. On the other hand, the histogram for Chance of Admission is slightly right-skewed. This means that there is a small number of students who have a very low chance of gaining admission. Most of the students are accumulated around the median of 0.72. This means they have a moderate to high chance of admission into graduate programs.
# Histogram & Fitted Distributions for TOEFL Score
hist(admission_data$TOEFL.Score,
breaks=seq(90, 120, by=2),
col="skyblue",
main="Distribution of TOEFL Score",
xlab="TOEFL Score",
border="black",
xlim=c(90,121),
prob=TRUE,
xaxt="n")
x_ticks <- seq(90, 120, by=2)
axis(1, at=x_ticks, labels = x_ticks)
lines(density(admission_data$TOEFL.Score), col="red", lwd=2)
# Histogram & Fitted Distribution for Chance of Admission
hist(admission_data$Chance.of.Admit,
breaks=seq(0,1, by=0.05),
xlim=c(0,1),
col="red",
main="Distribution of Chance of Admission",
xlab="Chance of Admission",
border="black",
prob=TRUE)
lines(density(admission_data$Chance.of.Admit), col="blue", lwd=2)
The histogram of TOEFL Scores displays the distribution of the predictor variable x=TOEFL Scores, these values range from 0 to 120. The data displays a normal distribution with a minimum score of 92 and a maximum score of 120. The density curve follows a similar distribution as the histogram. Out of 400 applicants, the most frequent scores are 106 and 110, the average score is 107 which highlights that average applicants are competitive. The histogram of Chance of Admission displays a slightly left-skewed distribution which suggests that most of the data points are concentrated on the right side. Chance of Admission has a greater density of over 70%.
# Histogram & Fitted Distribution for GRE Score
hist(admission_data$GRE.Score,
breaks=20,
col="blue3",
main="Distribution of GRE Scores",
xlab="GRE Score",
border="black",
prob=TRUE)
lines(density(admission_data$GRE.Score), col="red", lwd=2)
# Histogram & Fitted Distribution for Chance of Admission
hist(admission_data$Chance.of.Admit,
breaks=10,
col="red2",
main="Distribution of Chance of Admission",
xlab="Chance of Admission",
border="black",
prob=TRUE)
lines(density(admission_data$Chance.of.Admit), col="blue", lwd=2)
The histogram displays the distribution of the predictor variable, GRE Scores, which ranges from a typical low of around 260 to a high of 340. The data shows a normal distribution with the minimum score being 290 and a maximum of 340. The density curve depicts a similar distribution of GRE Scores. The most common scores were from between 310-325. The histogram for COA reveals the probability distribution of applicants’ chances of being accepted. A left-skewed distribution would suggest most applicants have a higher probability of admission. Because there are higher densities over 70%, the applicants scores are generally competitive.
# Histogram & Fitted Distribution for SOP
hist(admission_data$SOP,
breaks=10,
col="lightblue",
main="Distribution of SOP",
xlab="SOP",
border="black",
prob=TRUE)
lines(density(admission_data$SOP), col="red", lwd=2)
# Histogram & Fitted Distribution for Chance of Admission
hist(admission_data$Chance.of.Admit,
breaks=10,
col="red",
main="Distribution of Chance of Admission",
xlab="Chance of Admission",
border="black",
prob=TRUE)
lines(density(admission_data$Chance.of.Admit), col="blue", lwd=2)
The SOP histogram shows a smooth curve, however it is slightly left-skewed with the highest points being between 3.0 and 4.0. Between 1.0 and 3.0 there is also a slightly steeper slope, meaning that there is low variability with most values falling close to the mean (3.4) and median (3.5), with few outliers. The chance of admission histogram is also slightly left-skewed, however it does not have such a smooth curve. Instead, it goes from a steep slope to a flat slope and back again, suggesting that there are central peaks with gradual spread. Comparatively, the chance of admission histogram has a steeper slope than the SOP histogram, implying that chance of admission has more data points concentrated around a specific value (0.73).
# Quantile-Quantile plots of CGAP & Chance of Admission
qqnorm(admission_data$CGPA, main = "Q-Q Plot CGPA", ylab= "Count", xlab="Quantiles")
qqline(admission_data$CGPA, col="red")
qqnorm(admission_data$Chance.of.Admit, main = "Q-Q Plot Chance of Admission", ylab = "Count", xlab="Quantiles")
qqline(admission_data$Chance.of.Admit, col="red")
In the Q-Q plot of CGPA, the points are closely aligned to the normal distribution, but they begin to drift at the end of both tails. This suggests that there are a few outliers that may have a very high or very low CGPA. On the other hand, the Q-Q plot for Chance of Admission also closely follows the normal distribution line, with fewer points that drift off (there is one extreme outlier at the bottom).
# Boxplot of CGPA & Chance of Admission
boxplot(admission_data$CGPA, main="Boxplot of CGPA",col = "plum", border = "black", xlab=" CGPA", ylab="Count")
boxplot(admission_data$Chance.of.Admit, main="Boxplot of Chance of Admission", col="lightgreen", border = "black", xlab="Chance of Admission", ylab="Count")
The boxplot of CGPA shows that it is slightly skewed to the right, and the median line is closer to the bottom of the box. There are no major outliers and the range is spread over a moderate distance which once again confirms that the values are mostly centered around the median. The boxplot of Chance of Admission, also shows a mild right skew and the median line is closer to the bottom. There is one outlier at the bottom of the box, which suggests at least one with a very low chance of admission.
# Quantile-Quantile plot for TOEFL Score
qqnorm(admission_data$TOEFL.Score,
main = "QQ Plot of TOEFL Scores",
xlab = "Quantiles",
ylab = "Scores")
qqline(admission_data$TOEFL.Score, col="blue")
# Quantile-Quantile plot of Chance of Admit
qqnorm(admission_data$Chance.of.Admit,
main = "QQ Plot of Chance of Admission",
xlab = "Quantiles",
ylab = "Count")
qqline(admission_data$Chance.of.Admit, col="blue")
The QQ Plot for TOEFL Scores shows that the majority of data points are normally distributed around the central region of scores from 100 to 115. The QQ Plot also indicates some deviations in the data points above the QQ line in quantiles (-3,-1), these scores are higher than expected of a normal distribution. The QQ Plot for Chance of Admission shows that the majority of data points are normally distributed around the central region between 60 to 90 percent. The QQ Plot also indicates some deviations in the data points below the QQ line in quantiles (-3,-1), the data has lower percentage values than expected of a normal distribution.
# Boxplot of TOEFL Score
boxplot(admission_data$TOEFL.Score,
main="Boxplot of TOEFL Score",
xlab="TOEFL",
ylab="Scores",
col="skyblue",
border="black",
ylim=c(90,120))
# Boxplot of Chance of Admission
boxplot(admission_data$Chance.of.Admit,
main="Boxplot of Chance of Admission",
xlab="Chance of Admission",
ylab="Count",
col="red",
border="black",
ylim=c(0,1))
The boxplot for TOEFL Scores indicates a median score of 107. The first quantile score of 103 suggests that 25% of scores are below this value. The third quantile score of 112 suggests that 75% of scores are below this value. The Interquartile Range is 9 which indicates some variability however, the data points are relatively close together and clustered around the median. The boxplot for Chance of Admission indicates a median value of 73%. The first quantile value of 64% suggests that 25% of students have a chance of admission lower than this value. The third quantile value of 83% suggests that 75% of students have a chance of admission below this value. The Interquartile Range is 19% which indicates a higher variability of the data points compared to the median.
# Quantile-Quantile plots of GRE Score & Chance of Admission
qqnorm(admission_data$GRE.Score, main = "QQ Plot of GRE Scores",
xlab = "Quantiles",
ylab = "Scores")
qqline(admission_data$GRE.Score, col="blue3")
qqnorm(admission_data$Chance.of.Admit, main = "QQ Plot of Chance of Admission",
xlab = "Quantiles",
ylab = "Count")
qqline(admission_data$Chance.of.Admit, col="blue3")
The QQ plot for GRE scores would reveal how well the data aligns with a normal distribution. GRE scores typically cluster within a central range, like 300–320, with deviations at the extremes (lower and higher quantiles). Minor deviations might appear, suggesting scores higher or lower than expected, but a normal distribution fit would validate that GRE scores are a consistent metric. The QQ plot for COA would similarly indicate whether admission chances are normally distributed. Central values around 60-90% would align well on the QQ line, with deviations below for lower admission chances. These outliers suggest that some students have chances of admission lower than typical for their GRE score range.
# Boxplot of GRE Score & Chance of Admission
boxplot(admission_data$GRE.Score, main = "Boxplot of GRE Scores", xlab="GRE", ylab=" Scores", col="skyblue3")
boxplot(admission_data$Chance.of.Admit, main = "Boxplot of Chance of Admission", xlab="Chance of Admission", ylab="Count", col="green3")
The GRE boxplot displays the median GRE score (often around 310-320 in competitive programs). The first quantile would capture the lower 25% (e.g., scores below ~305), while the third quantile might be higher, around 330. The interquartile range (IQR) would reveal score spread, indicating GRE scores are relatively clustered but with some variability. The COA boxplot shows the distribution of admission chances. With a median value around 70%, we see that the central tendency aligns with competitive admission rates. The IQR here might be broader due to variability in other factors affecting admissions, showing that even strong GRE scores don’t guarantee high chances of admission.
# Quantile Plots for SOP
qqnorm(admission_data$SOP, main= "QQ Plot of SOP",
xlab = "Quantiles",
ylab = "Values")
qqline(admission_data$SOP, col="blue2")
#Quantile Plots for Chance of Admission
qqnorm(admission_data$Chance.of.Admit, main = "QQ Plot of Chance of Admission",
xlab = "Quantiles",
ylab = "Count")
qqline(admission_data$Chance.of.Admit, col="blue2")
First it must be observed that the SOP dataset only deals in whole and half numbers so the graph visually looks different than the chance of admission Q-Q plot. The SOP Q-Q plot has more points left of the reference line, again suggesting that the data is left-skewed. The chance of admission Q-Q plot relatively follows the normal distribution line, meaning the distributions are similar and the data is well-approximated by the theoretical distribution. The only exception to this are the few points that drift off at the top and bottom, suggesting a couple of outliers in the dataset.
# Boxplot for SOP
boxplot(admission_data$SOP, main="Boxplot of SOP", xlab="SOP", ylab=" Values", col="lightblue")
# Boxplot for Chance of Admission
boxplot(admission_data$Chance.of.Admit, main="Boxplot of Chance of Admission", xlab="Chance of Admission", ylab="Count", col="red")
The shape of the SOP box falls between about 2.5 and 4.0 meaning that is where the majority of the data is distributed. The box sits in the middle of the graph and the median line is closer to the upper half of the box, suggesting that the data is slightly left-skewed. On the contrary, the chance of admission box plot has a median line that is very close to the middle, meaning the data is symmetrically distributed. The chance of admission box plot also shows that this data has an outlier at the bottom of the lower whisker.
# Scatterplot matrix of all variables and response variable
pairs(admission_data[, c("CGPA", "GRE.Score", "TOEFL.Score", "SOP", "Chance.of.Admit")])
The scatterplot of CGPA against Chance of Admission shows us that as out CGPA increases chance of gaining admission also rises. The scatterplot of TOEFL Score against Chance of Admission reveals a positive correlation between the two. The scatterplot of GRE vs. COA visually represents their relationship, showing whether higher GRE scores are associated with increased admission chances. n chances due to other criteria. The scatterplot of SOP vs. COA show there is a very slight positive correlation, however it is weak because the points are very spread out, meaning changes in one variable do not reliably predict changes in the other.
# Scatterplot of CGPA vs Chance of Admission
scatterplot (admission_data$CGPA, admission_data$Chance.of.Admit,
main = "Scatter Plot of CGPA vs Chance of Admission",
xlab = "CGPA",
ylab = "Chance of Admission",
col="plum3")
There is a positive linear relationship between CGPA and Chance of Admit, meaning that when CGPA increases the chances of gaining admission into Graduate school increases. Above 9.0 the points are closer to the trend line, which means there is a more consistent relationship between high CGPA and admission chances. However, on points below 8.0, the points are not as close to the trend line which indicates there is more variability in students with lower CGPA. This tells us that when we have a higher CGPA, it is a stronger predictor of chance of admission than if we were to have a lower CGPA (which may be more strongly influenced by other factors). There is also an outlier at 0.34.
# Scatterplot of TOEFL Score vs Chance of Admit
scatterplot(admission_data$TOEFL.Score, admission_data$Chance.of.Admit,
main = "Scatter Plot of TOEFL Score vs Chance of Admission",
xlab = "TOEFL Score",
ylab = "Chance of Admission",
col="steelblue1")
The scatterplot visually shows that the median TOEFL Score of 107 aligns with the Chance of Admission median percentage of 73%. There is a positive relationship between TOEFL Scores and Chance of Admission. As TOEFL Scores increase, the Chance of Admission increases as well. There is a tight cluster of data points around the line however, we can see some dispersion of the data points and outliers that reveal that some students have a high TOEFL score but have low chances of admission.This reveals that there may be other factors that play a role in the admissions decision process besides TOEFL scores. Since there are outliers there may be a need for a transformation because outliers can have an impact on regression results.
# Scatterplot of GRE Score vs Chance of Admit
scatterplot(admission_data$GRE.Score, admission_data$Chance.of.Admit,
main = "Scatter Plot of GRE Score vs Chance of Admission",
xlab = "GRE Score",
ylab = "Chance of Admission",
col="skyblue3")
The scatterplot matrix for GRE and COA should confirm the positive correlation trend, with points moving from lower-left to upper-right, suggesting a robust relationship. Tight clustering along the line supports GRE as a predictor, though some dispersed points reveal cases where high GRE scores do not correlate with higher admission chances, possibly indicating other influential factors in the admissions process. The presence of outliers might suggest a transformation to linearity could help to refine the model if outliers significantly affect the regression results.
# Scatterplot of SOP vs Chance of Admit
scatterplot(admission_data$SOP, admission_data$Chance.of.Admit,
main = "Scatter Plot of SOP vs Chance of Admission",
xlab = "SOP",
ylab = "Chance of Admission",
col="lightblue3")
Although slightly weak, there is a positive linear relationship between the data found in the SOP scatterplot and the chance of admission scatterplot. This positive relationship suggests that the data moves with each other, meaning as the SOP score increases, so does the chance of admission. However both scatterplots have quite a few outliers, with the SOP scatterplot having significantly more outliers. This could possibly suggest that they are not as heavily correlated or reliable on one another. Both scatterplots and the curves that they produced still imply that the data is slightly left-skewed. Because of this skew, it may be helpful to apply a logarithmic or square root transformation. This could possibly make the distribution more symmetric and improve the accuracy, bringing it closer to a normal distribution.
# Testing linearity for CGPA using powerTransform to compare p-values
p1 <- powerTransform(CGPA ~ 1, data=admission_data, family="bcPower")
summary(p1)
## bcPower Transformation to Normality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 1.3198 1 0.0494 2.5901
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 4.206439 1 0.040271
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 0.2442696 1 0.62114
According to the results, the estimated Lambda is 1.0796, with a lower bound of -0.0628 and an upper bound of 2.222. Since the value of Lambda is very close to 1 there is minimal deviation from the original untransformed variable, this means that a transformation is not necessary. For the LR test (likelihood ratio test), we need to check if Lambda is 0, this can be checked through the p-value. The p-value from our results is 0.0627 which is greater than the 0.05 (or 5%) threshold for statistical significance, this once again tells us that a transformation is not necessary. Our other p-value of 0.8913 is also well above the 0.05 threshold which further supports the idea that a transformation is not needed.
# Testing linearity for TOEFL Score using powerTransform to compare p-values
p1 <- powerTransform(TOEFL.Score ~ 1, data=admission_data, family="bcPower")
summary(p1)
## bcPower Transformation to Normality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 0.4826 1 -1.1232 2.0883
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 0.3476257 1 0.55546
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 0.3980178 1 0.52811
The results show an estimated power equal to 1, this indicates that there is no deviation from original untransformed variable. When LRT for lambda equal to 0, the p-value is greater than 5%. When LRT for lambda equal to 1, the p-value is greater than 5% as well. Both tests show high p-values greater than 5%, this suggests that we fail to reject the null, the data does not deviate from normality with these transformations. As a result, no transformation is needed for the TOEFL Score variable.
# Testing linearity for GRE Score using powerTransform to compare p-values
p1 <- powerTransform(GRE.Score ~ 1, data = admission_data, family = "bcPower")
summary(p1)
## bcPower Transformation to Normality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 1.687 1 -0.9103 4.2843
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 1.621962 1 0.20282
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 0.2688866 1 0.60408
The estimated power is given as 1.687, which suggests that a power transformation could improve the normality. The lower bound is -0.9103 and upper is 4.2843, and since the rounded power is close to 1, it seems that no transformation would be necessary. When lambda is equivalent to 0, the LR test statistic is 1.621962 with p-value of 0.20282, which indicates that the transformation is not necessary. When lambda is equivalent to 1, the LR test statistic is 0.2688866 with a p-value of 0.60408, reinforcing that a transformation is not necessary.
# Testing linearity for SOP using powerTransform to compare p-values
p1 <- powerTransform(SOP ~ 1, data=admission_data, family="bcPower")
summary(p1)
## bcPower Transformation to Normality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 1.2472 1 0.9574 1.537
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 83.44751 1 < 2.22e-16
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 2.8828 1 0.08953
In the likelihood ratio test (LRT) when lambda is 0, the LRT statistic is 83.44751. Because this is such a high value, it immediately suggests a stronger case against the null hypothesis. The p-value however is extremely small, basically 0, meaning we can confidently reject the null hypothesis. This means a transformation to linearity may be needed. However, in the LRT where lambda is 1 we do not have strong evidence to reject the null hypothesis. This is because the LRT statistic is 2.8828 which is much smaller than the last test, suggesting weaker evidence. The p-value is 0.08953 which is close to 0.05, but still above it, also suggesting that a transformation to linearity may not be necessary.
# Model 1: Simple linear regression for CGPA
model1 <- lm(Chance.of.Admit ~ CGPA, data=admission_data)
summary(model1)
##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA, data = admission_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.274575 -0.030084 0.009443 0.041954 0.180734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.07151 0.05034 -21.29 <2e-16 ***
## CGPA 0.20885 0.00584 35.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06957 on 398 degrees of freedom
## Multiple R-squared: 0.7626, Adjusted R-squared: 0.762
## F-statistic: 1279 on 1 and 398 DF, p-value: < 2.2e-16
confint(model1)
## 2.5 % 97.5 %
## (Intercept) -1.1704792 -0.9725442
## CGPA 0.1973654 0.2203290
The model shows that when CGPA increases by 1-point, the chance of admission rises by 20.59%. As CGPA improves so does the likelihood of gaining admission. The relationship is statistically significant since the p-value obtained is below 0.001, the data is also economically meaningful. The intercept is -1.044, this means that with a CGPA of 0, the Chance of Admission would be -1.04. The R-squared value of 0.7787 explains that 78% of the variation in admission chances is explained by the changes in CGPA, which means that there is a strong predictive relationship. The confidence interval suggests that you are 95% (97.5% - 2.5%) certain that if your CGPA increases by 1 point your college admission chance increases by 19.62% and 21.55%. The residuals have a small standard error of 0.06647, meaning they are close to the observed values.
# Model 2: Simple linear regression for TOEFL Score
model2 <- lm(Chance.of.Admit ~ TOEFL.Score, data=admission_data)
summary(model2)
##
## Call:
## lm(formula = Chance.of.Admit ~ TOEFL.Score, data = admission_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31252 -0.05128 0.01328 0.05453 0.21067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2734005 0.0774217 -16.45 <2e-16 ***
## TOEFL.Score 0.0185993 0.0007197 25.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08725 on 398 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6257
## F-statistic: 667.9 on 1 and 398 DF, p-value: < 2.2e-16
confint(model2)
## 2.5 % 97.5 %
## (Intercept) -1.42560706 -1.12119388
## TOEFL.Score 0.01718449 0.02001411
The p-values for both intercept and TOEFL score are less than 5% which indicates they are statistically significant. Since the p-values are low we have strong evidence against the null hypothesis for both parameters that states there is no significance (coefficients are equal to zero). TOEFL scores do have a statistically significant effect on the chance of admission, TOEFL scores are a statistically significant predictor. The intercept estimate is -1.27 which suggests that there is a low probability of admission with a very low TOEFL score. The TOEFL Score estimate is 0.0186 which means that for every one-point increase in a TOEFL score, the chance of admission increases by 1.86%. This information suggests that improving TOEFL scores can have a positively significant impact on an applicant’s chance of admission. There is an economic significance for applicants to focus on enhancing their TOEFL scores by investing time and resources.
# Model 3: Simple linear regression for GRE Score
model3 <- lm(Chance.of.Admit ~ GRE.Score, data = admission_data)
summary(model3)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score, data = admission_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33613 -0.04604 0.00408 0.05644 0.18339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4360842 0.1178141 -20.68 <2e-16 ***
## GRE.Score 0.0099759 0.0003716 26.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08517 on 398 degrees of freedom
## Multiple R-squared: 0.6442, Adjusted R-squared: 0.6433
## F-statistic: 720.6 on 1 and 398 DF, p-value: < 2.2e-16
confint(model3)
## 2.5 % 97.5 %
## (Intercept) -2.667700007 -2.2044685
## GRE.Score 0.009245267 0.0107065
The multiple R-squared value of 0.6442 indicates that around 64.42% of the variability in the Chance of Admission is explained by the GRE Score, suggesting a strong linear relationship. Given the intercept of -2.4361, if the GRE score was 0, the chance of admission would be -2.4361. For each additional point increase in the GRE score, the chance of admission is estimated to increase by 0.00998, depicting a linear relationship between GRE score and chance of admission. This implies that higher GRE scores are associated with greater chance of admission. The GRE score coefficient has a very small p-value, <2.2e-16, which further confirms that it is a statistically significant predictor of admission chances. The residuals show a slight skew, with a minimum of -0.336 and maximum of 0.183, suggesting that though the model is a good fit, there are a few observations where the model under or over predicts the chance of admission. The confidence interval for the GRE score coefficient is narrow, showing high precision in estimating the effect of GRE on admission.
# Model 4: Simple linear regression for SOP
model4 <- lm(Chance.of.Admit ~ SOP, data=admission_data)
summary(model4)
##
## Call:
## lm(formula = Chance.of.Admit ~ SOP, data = admission_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49748 -0.05392 0.01823 0.07037 0.22393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.398942 0.018556 21.50 <2e-16 ***
## SOP 0.095708 0.005233 18.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1053 on 398 degrees of freedom
## Multiple R-squared: 0.4566, Adjusted R-squared: 0.4552
## F-statistic: 334.4 on 1 and 398 DF, p-value: < 2.2e-16
confint(model4)
## 2.5 % 97.5 %
## (Intercept) 0.36246242 0.4354213
## SOP 0.08541962 0.1059969
The intercept found in this model is 0.3989. This is statistically significant because this number is a baseline against which chances in SOP are measured because when SOP is 0, the chance of admission is 0.3989. The SOP coefficient is 0.0957 which is also statistically significant because as SOP increases by one unit, chance of admission is also expected to increase by this number. The SOP coefficient is also economically significant because if there was a substantial improvement to SOP, there would also be a substantial increase in chance of admission meaning that they are very impactful on one another. The r-squared in this model is 0.4566 meaning that 45.66% of the variance of chance of admission is determined by SOP which is not very extreme. This means that SOP is important, but it is not the only substantial factor when it comes to chance of admission. The parameter of the estimates is between 0.0854 and 0.1060 with a 95% confidence for the SOP coefficient.
Model 1 has a high R-squared value of 0.7787 which is the most highly signficiant predictor (explaining 78% of the variation in the Chance of Admission). This R-squared value is also higher than that of all the other models, confirming that CGPA can capture most of the information we need to predict the Chance of Admission. The model’s p-value is also lower than 0.001 which supports the idea that this is statistically significant. The 95% confidence interval of the intercept tells us that we are 95% confident the range where the true intercept lies is between -1.1274 and -0.9612. The confidence interval of CGPA shows us that we are 95% confident the true coefficient is between 0.1963 and 0.2156, since the range is tight around 0.2059 we can conclude there is high confidence that this coefficient is not 0. In other words, The confidence interval suggests that you are 95% (97.5% - 2.5%) certain that if your CGPA increases by 1 point your college admission chance increases by 19.62% and 21.55%. Furthermore, since the entire interval is positive, it reinforces that CGPA has a positive impact on admission chances.
# Set seed for reproducibility
set.seed(123)
# Bootstrap function for intercept and slope
boot_fn <- function(data, indices) {
d <- data[indices, ]
model <- lm(Chance.of.Admit ~ CGPA, data = d)
coef(model)
}
# Run bootstrap for intercept and slope
results <- boot(admission_data, boot_fn, R = 1000)
summary(results)
##
## Number of bootstrap replications R = 1000
## original bootBias bootSE bootMed
## 1 -1.07151 5.1663e-04 0.047992 -1.07192
## 2 0.20885 -7.3726e-05 0.005388 0.20891
# Bootstrap function for R-squared
r_squared <- function(data, indices) {
d <- data[indices, ] # Resample data
model <- lm(Chance.of.Admit ~ CGPA, data = d)
return(summary(model)$r.squared)
}
# Run bootstrap for R-squared
r2_results <- boot(data = admission_data, r_squared, R = 1000)
summary(r2_results)
## R original bootBias bootSE bootMed
## 1 1000 0.76263 0.00015207 0.023051 0.76319
# Plot bootstrap distributions for coefficients and R-squared
plot(results, index = 1) # Plot distribution of Intercept
plot(results, index = 2) # Plot distribution of Slope (CGPA coefficient)
plot(r2_results) # Plot distribution of R-squared
# Fit the preferred model on the full data
final_model <- lm(Chance.of.Admit ~ CGPA, data = admission_data)
# Plot residuals
residuals <- residuals(final_model)
ggplot(data = data.frame(residuals), aes(x = residuals)) +
geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "skyblue") +
geom_density(color = "orange") +
ggtitle("Histogram of Residuals for the Preferred Model") +
xlab("Residuals") +
ylab("Density")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# QQ Plot of residuals
qqnorm(residuals, main = "QQ Plot of Residuals for the Preferred Model")
qqline(residuals, col = "red")
The bootstrap analysis using 1000 samples supports the reliability of model 1’s original estimates. The histogram of the bootstrap distributions for intercept and CGPA coefficient shows a normal, symmetric pattern that is clustered around the original values, -1.0443 (intercept) and 0.2059 (CGPA coefficient). The bootstrap standard errors are very small, 0.0409 for the intercept and 0.0046 for CGPA, these demonstrate that the previous estimates are stable. Additionally, the R-squared value of 0.77865 has a bootstrap bias of 0.00024 and a standard error of 0.0195, since it is a minimal value this confirms that the calculated R-square value is statistically significant. The histogram of residuals shows that there is normal distribution which is centered around 0 showing that there are not many outliers in the errors. The Q-Q plot of the residuals also follows the 45-degree line which reinforces the idea that residuals are normally distributed.
In conclusion, college GPA (CGPA) has the strongest influence on chance of admission, as it has the highest R-squared value of 0.7787. This informs us that about 77.87% of admission likelihood is explained by CGPA solely, indicating that it is the impactful predictor. Through our findings, each point increase in CGPA correlates with a 20.59% rise in admission chances, which reinforces the importance of CGPA during the graduate admissions process. Along with that, GRE Score and Test of English as a Foreign Language (TOEFL) also had a significant influence on chance of admission as well. GRE Score has an R-squared value of 0.6442 and TOEFL has an R-squared value of 0.6257. Statement of Purpose (SOP) had the lowest R-squared value of 0.4566, which indicates that it is the predictor that has the least impact on chance of admission. Although GRE Score and TOEFL are heavily influential on chance of admission, its impact is smaller in comparison to college GPA.